<html>
<head>
<title>jenny: a pairwise testing tool</title>
</head>
<body bgcolor="#ffffff" text="#000000" link="#000080" 
vlink="#008000" alink="#ff0000">

<h2><center><tt>jenny</tt></center></h2>

<p><tt>jenny</tt> is tool for generating regression tests.  Any time
exhaustive testing looks painful due to the combinatorial explosion of
features interactions to be tested, consider using <tt>jenny</tt>.
It will cover most of the interactions with far fewer testcases.  It
can guarantee pairwise testing of all features that can be used
together, and it can avoid those feature combinations that cannot.

<p>Table of Contents:
<ul>
<li><a href="#code">C code for <tt>jenny</tt></a>
<li><a href="#reason">The reasoning behind <tt>jenny</tt></a>
<li><a href="#usage">Usage</a>
<li><a href="#strategy">Strategy</a>
<li><a href="#example">Using <tt>jenny</tt> to test itself</a>
<li><a href="#algo">Algorithm used</a>
<li><a href="#tools">Pairwise combinatorial testing: aetg, jenny, allpairs</a>
</ul>

<a name="code"><center>
<h3>C code</h3>
</center></a>

<p><center><a href="../c/jenny.c">jenny.c</a> (February 5 2005 version)</center>

<p>Compile it like so: <tt>gcc -O4 jenny.c -o jenny</tt>

<p>Prebuilt Windows executable: <a href="jenny.exe">jenny.exe</a>
(supplied by Derrick Pallas, I think the September 14 2003 version)

<a name="reason"><center>
<h3>The reasoning behind <tt>jenny</tt></h3>
</center></a>

<p><b>Features</b> tend to come in <b>dimensions</b>, where a
<b>testcase</b> can choose one
feature from each dimension.  For example, the datatype of a variable is
a dimension: the variable can have only one of several datatype.  The
variable can be compared to another variable.  The type of the other
variable is another dimension.  The type of comparison (&lt;, &gt;, =) is
another dimension.  Each testcase chooses one feature from each dimension,
and often chooses them independently of each other.

<p><tt>jenny</tt> has <b>exhaustive testing</b> as its limit.  
Exhaustive testing is
to run one testcase for every combination in the cross-product of all
dimensions.  For example, exhaustive testing of 12 dimensions with 2
features each (for example 12 flags that can each be on or off)
requires 2<sup>12</sup> = 4096 testcases, and that's what <tt>jenny</tt>
produces:

<p><center><table><tr><td><pre>
PROMPT&gt;jenny -n12 2 2 2 2 2 2 2 2 2 2 2 2 | wc
   4096   49152  167936
</pre></td></tr></table></center>

(<tt>wc</tt> lists lines, words, characters.  <tt>jenny</tt> produces
one line per testcase.  So, 4096 testcases cover all 12-feature combinations.)

<p>Exhaustive testing is overkill.  The goal of testing is to find
bugs.  I have never seen a bug that only happens when twelve features
are used together.  Most bugs only require one, two, or three features
to be used together.  In our example, testing all twelve-feature
combinations is overkill.  Testing all three-feature combinations
would have been good enough.

<p><tt>jenny</tt> allows the user to ask to cover combinations of
features smaller than the number of dimensions.  At the other limit
from exhaustive testing is combinations of one feature, that is, to
cover every feature at least once.  12 dimension of 2 features each
have 24 features to cover.  One testcase per feature would be 24 testcases.
But behold:

<center><table><tr><td><pre>
PROMPT&gt;jenny -n1 2 2 2 2 2 2 2 2 2 2 2 2 | wc
      2      24      82
</pre></td></tr></table></center>

<tt>jenny</tt> produced
just two testcases. How can that be?  Let's look at the actual
testcases this time:

<center><table><tr><td><pre>
PROMPT&gt;jenny -n1 2 2 2 2 2 2 2 2 2 2 2 2
 1a 2b 3a 4b 5b 6b 7a 8a 9a 10a 11b 12a 
 1b 2a 3b 4a 5a 6a 7b 8b 9b 10b 11a 12b 
</pre></td></tr></table></center>

Each line is a "testcase", saying which features a single testcase
should cover.  Each testcase contains 12 features, and 12*2 = 24.  Two
testcases cover all features.

<p>Suppose we want to cover all triples of features.  There are (12 choose
3) = 220 ways to choose dimensions, and 2<sup>3</sup> = 8 ways to
choose features within those dimensions, for a total of 1760 possible
triples of features.  That's a quite a bit fewer than the 4096
testcase that exhaustive testing would give.  But <tt>jenny</tt> does
almost two orders of magnitude better than both:

<center><table><tr><td><pre>
PROMPT&gt;jenny -n3 2 2 2 2 2 2 2 2 2 2 2 2 | wc
     20     240     820
PROMPT&gt;jenny -n3 2 2 2 2 2 2 2 2 2 2 2 2 | sort
 1a 2a 3a 4a 5a 6b 7b 8a 9a 10b 11a 12a 
 1a 2a 3a 4a 5a 6b 7b 8b 9b 10a 11b 12b 
 1a 2a 3a 4a 5b 6b 7a 8b 9b 10b 11b 12b 
 1a 2a 3b 4b 5b 6a 7a 8a 9a 10b 11b 12b 
 1a 2a 3b 4b 5b 6b 7b 8b 9b 10a 11a 12a 
 1a 2b 3a 4b 5a 6a 7b 8a 9b 10a 11a 12b 
 1a 2b 3a 4b 5b 6b 7a 8a 9a 10a 11b 12a 
 1a 2b 3b 4a 5a 6a 7a 8b 9a 10a 11a 12a 
 1a 2b 3b 4a 5a 6b 7b 8b 9a 10a 11b 12b 
 1a 2b 3b 4a 5b 6a 7b 8a 9b 10b 11a 12a 
 1a 2b 3b 4b 5a 6a 7b 8b 9b 10b 11b 12a 
 1b 2a 3a 4a 5b 6a 7b 8b 9a 10a 11b 12b 
 1b 2a 3a 4b 5b 6a 7a 8a 9b 10b 11a 12a 
 1b 2a 3b 4a 5a 6a 7b 8b 9b 10b 11a 12b 
 1b 2a 3b 4a 5a 6b 7a 8a 9b 10a 11b 12a 
 1b 2a 3b 4b 5a 6b 7a 8b 9a 10a 11a 12b 
 1b 2b 3a 4a 5a 6a 7a 8a 9a 10b 11b 12b 
 1b 2b 3a 4b 5b 6b 7b 8b 9a 10b 11a 12a 
 1b 2b 3b 4a 5b 6b 7a 8b 9b 10a 11a 12b 
 1b 2b 3b 4b 5a 6b 7b 8a 9b 10b 11b 12b 
</pre></td></tr></table></center>

<tt>jenny</tt> covers all triples of features in 20 testcases.  Run
down any three columns, if you'd like, and confirm that all 8 possible feature
combinations in them are covered.  Every testcase covers exactly one of
those choices, so 8 testcases is a lower bound for covering all triples of
features.  The minimal number of testcases may be 8, or may be
higher.  (In this case it seems to be 12 testcases that form a <a
href="lexicode.html">Golay code</a>.  I don't know how to generalize
that in any useful way, though.)  <tt>jenny</tt> usually achieves the
minimum number of testcases when -n is 1 or the number of dimensions.
For -n in between it usually does slightly worse than optimal.

<a name="usage"><center>
<h3>Usage</h3>
</center></a>

<p><b><tt>jenny</tt></b> must be given at least n dimensions of
features when you want to cover all n-tuples.  A dimension is represented by
just a number (the number of features in that dimension).  The number of
features in a dimension must be an integer in 2..52.  Features in
dimensions are implicitly named, in order, <b><tt>a b c d e f g h i j
k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S
T U V W X Y Z</tt></b>.  Dimensions are implicitly numbered 1..65535,
in the order that they appear.

<p><b><tt>-n</tt></b> gives the n in n-tuples.  It must be a number
between 1 and 32.  For example, -n2 means pairs and -n3 means triples.
If there are several -n, the last one wins.  The default for n is 2.

<p><b><tt>-s</tt></b> seeds the random number generator.  It must be
an integer in 0..2^^32-1.  The default is 0.  Different seeds produce
different tests.

<p><b><tt>-w</tt></b> gives a restriction, a "without".  -w1a2bc5d means
that the combination of the first dimension's first feature (1a), the
second dimension's second or third feature (2bc), and the fifth dimension's
fourth feature (5d) is disallowed.  The position of withouts is
irrelevant.

<p><b><tt>-o</tt></b> is for old testcases.  For example -ofoo.txt will read
old testcases from the file foo.txt.  It will include those testcases first
in the output, then append new testcases as required.  The set of inputs 
can be terminated by end-of-file or by a line starting with '.' .  The old
testcases should be as they appear in jenny output, with features given for
every vector in order, separated by spaces, with a leading and trailing 
space.  -o is useful for reusing the testcases that you've already written
when you find you need to add a -w or change the size of some dimension in
your jenny command.

<p><b><tt>-h</tt></b> prints out instructions for using <tt>jenny</tt>.

<a name="strategy"><center>
<h3>Strategy</h3>
</center></a>

<p>Here are some heuristics for using <tt>jenny</tt>.
<ul>
<li>You have to implement every testcase jenny recommends.  You can't
skip any.  For example, suppose you do
<center><table><tr><td><pre>
PROMPT&gt;jenny 2 2 2
 1a 2a 3b
 1a 2b 3a
 1b 2a 3a
 1b 2b 3b
</pre></td></tr></table></center>
then afterwards you decide you want to not cover 1b2a.  If you
don't implement 1b2a3a, it's true you won't cover 1b2a.  But you'll
also lose your coverage of pairs 1b3a and 2a3a.  (Yes I faked that
output.)</li>
<li>If you want to not cover certain combinations, use -w.  You can
have as many -w as you want.  Use -w to avoid combinations that raise 
errors or otherwise prevent the other features in the testcase from
being covered. <tt>jenny</tt> assumes that each testcase demonstrates all
tuples in that testcase actually working.  If that assumption is not
satisfied, <tt>jenny</tt>'s guarantee of coverage does not hold.</li>
<li>Use a sed script like <a href="jenny_gen.sed">this one</a> to
translate <tt>jenny</tt> output into something readable.  Arrange your
dimensions so it is as easy as possible to read through the output and
type in the testcase.</li>
<li>A complete test suite needs more than just feature interaction
tests.  It also needs (this list may not be complete) 
<a href="jenny_err.sh">error</a> <a href="jenny_err.log">cases</a>,
unit tests that do code coverage, testcases for bugs found,
and stress tests.  The bugs found in the other tests can give you good
ideas about what features to feed to <tt>jenny</tt>.</li>
<li>You're probably forgetting over half of your dimensions.  As you
write your testcases, make an effort to do them all differently, even
more differently than <tt>jenny</tt>'s specification requires.
<tt>jenny</tt> still proves useful by giving you situations that suggest
interesting things to test.</li>
<li>Very critical, very intertwined, or very large projects may benefit
from covering all triples of features (-n3).  For everything else,
covering all pairs (-n2) is sufficient.</li>
<li>Add as many dimensions as you can.  You get a lot more coverage at
little cost.  The number of testcases (the work you'll do) only grows
with the log of the number of dimensions.  The number of n-tuples
covered (the coverage you'll get) grows with the nth power of the
number of dimensions.
<li>Features don't have to be things in the syntax.  They can be things
the code does.  For example flushing an object to disk due to low
memory can be a feature, even if there's no syntax for it.  Writing
testcases for such features can be tricky, though.</li>
<li>Try to avoid large dimensions.  The number of testcases is
at least the product of the sizes of the n largest dimensions.  Large
dimensions can often be viewed as the product of smaller dimensions: for
example, the 6-feature dimension (=, !=, &lt;, &lt;=, &gt;, &gt;=) can
be viewed as a 3-feature dimension (=, &lt; &gt;) times a 2-feature
dimension (self, not-self).  Beware, factoring a dimension means turning
each feature in it into two features, so it reduces coverage.
Factoring is only appropriate if each feature really is two features,
or if which feature of the dimension is covered isn't very
important.</li>
</ul>

<a name="example"><center>
<h3>Using <tt>jenny</tt> to test itself</h3>
</center></a>

<p>The obvious way to show that <tt>jenny</tt> works in practice is to
use it to test itself.  I'm not sure whether -n2 or -n3 is
appropriate.  We'll be pessimistic and use -n3, that is, we'll cover
all triples of features, and to include as many dimensions as
possible.  So, here goes!</p>

<p>First I came up with all the dimensions of features I could think
of.  I only found 12 (<tt>jenny</tt> is a pretty simple program, and
I'm not a tester by trade).  I
spelled them out in a sed script, <a
href="jenny_gen.sed"><tt>jenny_gen.sed</tt></a>, that can translate
<tt>jenny</tt> output into something short but interpretable.  I spent
some effort rearranging the dimensions so that I could type in
a testcase as I read through the output without too much backtracking.  Each
dimension has three or four settings (each setting is a "feature").</p>
  
<p>Next came the <a href="jenny_gen.sh">command itself</a>:
<center><table><tr><td><pre>
./jenny -n3 4 4 3 3 3 3 3 3 4 3 3 4 -w1abc2d -w1d2abc -w6ab7bc \
  -w6b8c -w6a8bc -w6a9abc -w6a10ab -w11a12abc -w11bc12d -w4c5ab \
  -w1a3a -w1a9a -w3a9c | sed -f jenny_gen.sed > jenny_gen.txt
</pre></td></tr></table></center>
This is the final form.  The original was simpler, but when trying to
write the first two testcases I found many restrictions that had to be
added.  (For example, <tt>-w1a9a</tt> says you can't have n=1 and also
have withouts that restrict fewer than n dimensions, because there is
no such thing as a without with zero dimensions.)

<p>Running that command recommended <a href="jenny_gen.txt">120
testcases</a> that would cover all triples of features.  The first 26
cover all pairs of features.  The first 5 cover every feature
at least once.</p>

<p>For each of <tt>jenny</tt>'s recommendations, I wrote a
testcase by hand that contained all those features (<a
href="jenny_test.sh">jenny_test.sh</a>).  For example, here's the
first <tt>jenny</tt> output, the handwritten testcase, and its results:
<center><table><tr><td><pre>
# test1: (use -n1) (put -n last) (between n and 10 dimensions) (a dimension of size 2)
  (all dimensions equal size) (several restrictions) (some restrictions fully duplicated)
  (a restriction comes after the dimensions it constrains) (one -s) (put -s first) 

./jenny -s9 2 -n8 2 2 2 2 2 2 2 2 -w4b3a -w3a4b -w3a5a4a -w5b4a3a -w7a4b \
 -w8a3b4b -w9a5a6a7a4a -n1

Could not cover tuple  3a 
 1a 2a 3b 4a 5b 6b 7a 8a 9b 
 1b 2b 3b 4b 5a 6a 7b 8b 9a 
</pre></td></tr></table></center>
The testcase I wrote obeys jenny's recommendations.

<p>Once I got rolling it took me about 3 minutes on average to write a
testcase.  Once finished, I had to proofread and correct the
testcases.  The output is about 700K (<a
href="jenny_test.log">jenny_test.log</a>).  On Solaris, I could
produce the log like so:
<center><table><tr><td><pre>
sh -v jenny_test.sh &gt;&amp; jenny_test.log
</pre></td></tr></table></center>
On Windows that didn't work, so I did "<tt>sh -v jenny_test.sh</tt>" in
a shell in Emacs, trimmed the buffer, and saved the results to
jenny_test.log. 

<p>So, did it find any bugs?  I wrote this test after doing some code
coverage and playing with <tt>jenny</tt> for a few weeks, so it was
already stable for all the inputs I could think of.  But
<tt>jenny</tt> still found <a href="jenny_bug.html">7 new bugs</a> in all,
in testcases #2, #10, #11, #11, #13, #13, and #35.  It was clear the
bugs were petering out once I'd covered all pairs of features
(testcases #1 through #26), but I slogged through all triples just to
demonstrate it could be done and to find out whether more bugs would
be found.  Perhaps more complicated software would find covering all
triples worthwhile.</p>

<p>I could have written another program specific to the thing being
tested to automatically translate the output into runnable testcases.
Would that have taken longer than writing the tests by hand?
Probably. Definitely, had I stopped after covering all pairs of
features.  I also added more variability in the testcases by hand than the
dimensions specifically asked for.  That would have been lost
had I written a program to automatically produce runnable testcases.
Writing the testcases by hand also allowed me to use complicated
features.  For example, one features was "do the combined restrictions
imply another implicit restriction?"  That would have been difficult
for an automatic test generator to deal with.  Writing the
testcases by hand appears to be worthwhile.

<a name="algo"><center>
<h3>Algorithm</h3>
</center></a>

<p><tt>jenny</tt> first covers single features, then pairs of features, then
triples, and so forth up to the n-tuples requested by the user.  Each
pass checks whether the existing tests cover all tuples, and if not,
make a list of uncovered tuples and add more tests until all tuples
are covered.  It tries to find testcases that obey the restrictions and
cover a lot of new tuples.  Any tuple that it can't cover no matter
how hard it tries without disobeying some restriction, it says it
can't cover it and adds it to its list of restrictions.  Then it goes
on to one-bigger tuples.  The running time is empirically about 
O(d<sup>2.3</sup>f<sup>4.6</sup>) for d dimensions all of size f (for -n2).

<p>Several improvements are possible.
<ul>
<li><tt>jenny</tt> could read in old testcases with fewer dimensions
and fill in the missing dimensions.

<li><tt>jenny</tt> could walk through all tuples for each
dimension once, caching as many as it can and moving on only once
cached ones are covered, rather than generating all n-tuples at once
and forcing memory to be big enough to hold them.  That would allow
<tt>jenny</tt> to do some amount of optimization even when the total
number of tuples is too big to fit in memory.</li>

<li>If there is enough room in memory, <tt>jenny</tt> could use
augmenting paths to move tuples covered only in the last test into
earlier tests.  The end of the augmenting path would be a change that
doesn't eliminate any existing tuples.  The links would be changes
that displace one existing tuple, which needs to be placed elsewhere.
I've used this technique successfully before, for <a
href="../hash/perfect.html">perfect hashing</a>.</li>

<li><tt>jenny</tt> ought to support nested dimensions.  For example,
the type of comparison is a dimension, the type of lefthand side is a
dimension, and the type of righthand side is a dimension.  But whether
to have a predicate at all is another dimension.  If you choose not to
have a dimension, those nested dimensions (lhs, comparison, rhs) are
ignored.  I can't simply add the whether-or-not dimension because
<tt>jenny</tt> would count the chosen but ignored features toward its
total.  So I could add lots of restrictions, or better yet, change the
interface to support nested dimensions.  The interface should be BNF, 
with something like xpath to specify restrictions.  This one sounds
like a lot of work, so I'll put it off until I learn that someone is
actually using <tt>jenny</tt> and running into this problem.  A
workaround in the meantime is to run <tt>jenny</tt> once at n with all
nested dimensions, then once per set of nested dimensions at n-1 with
one set removed, et cetera, and use the union of all the
testcases.</tt></li>

<li><tt>jenny</tt> ought to have a file-based interface that names all
the features and dimensions directly, instead of using a command line
with numbers and a sed script.  The file-based interface should allow
up to 64K features per dimensions and up to 64K dimensions, bypassing
the limits on how long a command line can be.  Nested dimensions are
complicated enough to almost demand a file-based interface.
<tt>jenny</tt> can keep the current command line interface, but it
ought to have a file-based interface as well.</tt></li> 

<li><tt>jenny</tt> could guarantee different coverage for different
dimensions.  For example, it could guarantee all pairs for some
dimensions, but all triples for others.  I'm not going to do this
because it can already be done: define a dimension as the product of
two smaller dimensions, eg a 24-case dimension becomes a 6-case dimension
times a 4-case dimension.  If you're normally covering triples, you'll
cover just pairs of the split dimension.</li>
</ul>

<a name="tools"><center>
<h3>Tools for covering all n-tuples</h3>
</center></a>

<p>Exhaustive testing grows exponentially with the number of dimensions
to be tested together.  But if you limit yourself to just covering all 
pairs (triples, quadruples, n-tuples) of features, and allow every
testcase to cover many n-tuples, the number of testcases required
grows only logarithmically with the of the number of dimensions.
<tt>jenny</tt> is just one tool that exploits this.  The site <a href=
"http://www.pairwise.org">pairwise.org</a> has a list of many such
tools, and a comparison of the number of testcases each produces for
various inputs.  (So far it doesn't list which ones support
restrictions.  I think these tools are unusable unless they support
restrictions.  Several do, several don't.)

<a name="history"><center>
<h3>History</h3>
</center></a>
<p><ul>
<li><a href="../c/jenny_040205.c">February 5 2005</a>.  Renamed rand to
flearand and #included stdlib.h.  No change to semantics or perforance.</li>
<li><a href="../c/jenny_040302.c">March 2 2004</a>.  Fix a bug in -o.  Add
a sanity check at the end that every tuple is covered either by a
without or by a testcase.</li>
<li><a href="../c/jenny_030914.c">September 14 2003.</a>  Added -o.</li>
<li><a href="../c/jenny_030804.c">August 4 2003.</a>  Self-tested.</li>
</ul></p>

<hr line=1>

<p><a href="../hash/perfect.html">A minimal perfect hash generator</a>
<br><a href="../physics/kempler.html">Klemperer Rosettes and other odd orbits</a>
<br><a href="../scout/index.html">Ye Olde Catalogue of Boy Scout Skits</a>
<br><a href="../other/recipe.html">Chocolate Chicken and other recipes</a>
<br><a href="../index.html">Table of Contents</a>
</body>
</html>

